Chapter 1: Introduction

Customer churn, also known as customer turnover, occurs when a customer chooses to leave or unsubscribe from a service for any reason. In the context of this project, we are looking at customer churn in banking. Being able to predict customer churn is important because we want to retain as many customers as we can. (Kaemingk, 2018.)

Of course you want to keep customers, but it also is cheaper to retain your customers, than having to spend money to acquire new customers to replace the ones that left. Lower customer acquisition costs equal more profits. By identifying customers at risk of churn, we are able to identify which efforts in order to maximize their likelihood of staying. (Guliyev et al., 2021.)

We chose to analyze a multinational bank’s churn for several reasons. As an international team, we were interested in topics that spanned multiple countries, with a particular focus on either business or environment, our personal interests. When we were pitching topics, nearly every suggestion was geared towards sales or customer experience, or global environmental issues.

The aim of this project is to identify the most useful model for predicting churn churn, across three countries serviced by a multinational bank.

Customer Analysis Overview

From our EDA in the mid term we found that more than 50% of the customer are based in France with majority of customer of average age group of 39 being in relation with bank for more than a year,51% of active customers while Male customers are predominantly higher than Female customers and also 71% of the customer uses credit card with average credit score of 650 by this analysis we found that 20% of the customer are churned,hence we decided to predict what factors are significantly affecting the customer churn in multinational bank.The model identified can then be used to inform banks of what customer activity trends and attributes should be monitored so the bank can make efforts to retain the customers before the account has been closed.

Chapter 2: Description of the Data

This study is based on a data set that contains 10,000 observations of bank customers with 12 variables. (See below for a readout of the dataset’s structure and variable names.) Variable descriptions are as follows and we choose the data set from Kaggle (Topre, 2022.)

## 'data.frame':    10000 obs. of  12 variables:
##  $ customer_id     : int  15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
##  $ credit_score    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ country         : chr  "France" "Spain" "France" "France" ...
##  $ gender          : chr  "Female" "Female" "Female" "Female" ...
##  $ age             : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ tenure          : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ balance         : num  0 83808 159661 0 125511 ...
##  $ products_number : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ credit_card     : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ active_member   : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ estimated_salary: num  101349 112543 113932 93827 79084 ...
##  $ churn           : int  1 0 1 0 0 1 0 1 0 0 ...
  1. Customer ID - The Unique ID of each individual customer
  2. Credit Score - A number depicting the customer’s creditworthiness
  3. Country - The country the customer banks from
  4. Gender - The gender the customer identifies with
  5. Age - The customers age
  6. Tenure - Indicates how length in years the customer has been with the bank
  7. Balance - The amount currently available in the customer’s account
  8. Products Number - The number of products purchased by the customer through the bank
  9. Credit Card - Indicates the customer has a credit card
  10. Active Member - Indicates if the customer is an active or inactive
  11. Estimated Salary - Bank Estimation of the income of the customer
  12. Churn - Indicator of if the customer has left the bank or not

Chapter 3: Independent Variables EDA Recap

Cleaning the Data

In preparation for exploratory data analysis, we took several steps to clean the data. We immediately dropped the customer_id variable, as we do not need the account holder’s unique identifier for our purpose. After dropping customer_id our dataset had 11 variables:

## 'data.frame':    10000 obs. of  11 variables:
##  $ credit_score    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ country         : chr  "France" "Spain" "France" "France" ...
##  $ gender          : chr  "Female" "Female" "Female" "Female" ...
##  $ age             : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ tenure          : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ balance         : num  0 83808 159661 0 125511 ...
##  $ products_number : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ credit_card     : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ active_member   : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ estimated_salary: num  101349 112543 113932 93827 79084 ...
##  $ churn           : int  1 0 1 0 0 1 0 1 0 0 ...

Next, we checked for duplicate records in the data set,we found no duplicates in the data set and also checked for null variables, and 0 were found.We then converted the following variables into categorical variables:credit_card, active_member, churn, gender, tenure, product number, age.We also converted Boolean values (0,1) into character format for Credit_card (credit_card,no-credit card), active_members (Active,In Active) and churn (Churned, Retained) variables in the data set, in order to understand each variable in the data set clearly during plotting and analysis.

credit_card active_member churn
Credit Card Active Churned
No-Credit Card Active Retained
Credit Card In Active Churned
No-Credit Card In Active Retained
Credit Card Active Retained
Credit Card In Active Churned

Finally, we checked our continuous variables for outliers using the outlierKD function.

## Outliers identified: 15 
## Proportion (%) of outliers: 0.2 
## Mean of the outliers: 361 
## Mean without removing outliers: 651 
## Mean if we remove outliers: 651 
## Outliers successfully removed

## Outliers identified: 359 
## Proportion (%) of outliers: 3.7 
## Mean of the outliers: 69.3 
## Mean without removing outliers: 38.9 
## Mean if we remove outliers: 37.8 
## Outliers successfully removed

By using the outlierKD function we can observe that outliers were found only in age and credit_score variables (age: 3.7% and credit_score: 0.2%), so we decided to remove these two variables using the outlierKD function.

Customer Analysis Review

In this section we are going to analyze each variable in the Bank churn dataset by using plots along with finding calculated mean, SD and percentages for each variable.

What is the average credit score of the customers?

The average credit score of the customer is 650.529,most of the customers having credit score fall between 600 to 700 and standard deviation is 96.953. The below histogram shows the range of credit scores.

Which are the different countries in which a customer holds a bank account?

The customers are grouped by the countries in which they have their accounts. As we see from the plot, France has more than 50% of customer accounts which is the highest among all other countries with Germany and Spain sharing equal percentages.Below, bar plot represents the number of customers in three different countries.

What is the average age of the customer?

The majority of the bank customers fall below the age of 50 with average age of 39 and with standard deviation of 10.5.

How many years does the customer have a relationship with the bank?

Most of the customers have been with the bank for more than a year.

What is the percentage of Male and Female customers who hold an account in the bank?

The bank’s customers are predominantly male which make up 55% of the customer base with the females making up the remaining 45%.

What is the percentage of Active and Inactive account holders in the bank?

48.5% of customers are being inactive.

What are the different types of services/Products purchased by customers provided by the bank for better usage of baking services?

Most of the customers use product 1 which is 50% and product 4 is the least used with 0.6%.

What percentage of customers make use of credit cards?

Predominantly 71% of the Bank customers use credit card and only 29% do not.

Currently what percentage of customers are churned from the bank?

The bank managed to retain 80% of their customers with the remaining 20% where churned out.

So, now the big question was what factors affected the 20% churn rate in the multinational bank across different countries. We utilized chi-squared tests, two-sample T-tests, and corrplots in order to find which factors influenced customer churn:

Variables’ Influence on Churn

## 
##  Pearson's product-moment correlation
## 
## data:  churn_data$balance and as.numeric(churn_data$churn)
## t = 12, df = 9998, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0992 0.1378
## sample estimates:
##   cor 
## 0.119

## [1] -0.0271

## [1] 0.285

Chapter 4: Handling Imbalanced Data

In our EDA, we found out that our binary target variable, churn, was imbalanced with 7963 instances of 0/Retained customers, and 2037 instances of 1/Churned customers.

With 79.6% of customers 0/Retained, and 20.45 of customers 1/Churned, we needed to balance our dataset and bring this number closer to a 50/50 split for the purpose of our training set. We utilized the ovun.sample function from ROSE package to undersample the data, and result in a more even split. With the new proportions, we are ready to train-test-split and proceed:

Chapter 5: SMART QUESTIONS

Linear Regression

1.Does the combination of customer demographic variables country, age group, gender influence the churn rate of the customers?

2.Whether the customers using fewer products with higher salary,account balance and status of the account affect the churn rate?

Finding the Best Model

3. What are the principal components for predicting churn rate?

Before we work on PCA, we need to ensure that the data is scaled and all the variables are either numeric or int type.

## 'data.frame':    10000 obs. of  11 variables:
##  $ credit_score    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ country         : chr  "France" "Spain" "France" "France" ...
##  $ gender          : chr  "Female" "Female" "Female" "Female" ...
##  $ age             : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ tenure          : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ balance         : num  0 83808 159661 0 125511 ...
##  $ products_number : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ credit_card     : chr  "Credit Card" "No-Credit Card" "Credit Card" "No-Credit Card" ...
##  $ active_member   : chr  "Active" "Active" "In Active" "In Active" ...
##  $ estimated_salary: num  101349 112543 113932 93827 79084 ...
##  $ churn           : chr  "1" "0" "1" "0" ...
## 'data.frame':    10000 obs. of  9 variables:
##  $ country         : num  0 2 0 0 2 2 0 1 0 0 ...
##  $ gender          : num  1 1 1 1 1 0 0 1 0 0 ...
##  $ age             : int  39 43 44 27 31 24 34 25 35 45 ...
##  $ tenure          : int  1 2 4 2 6 3 10 5 7 3 ...
##  $ balance         : num  0 125511 142051 134604 102017 ...
##  $ products_number : int  2 1 2 1 2 2 2 2 2 2 ...
##  $ credit_card     : num  1 0 1 0 1 1 1 1 0 1 ...
##  $ active_member   : num  1 1 0 0 1 0 1 0 1 1 ...
##  $ estimated_salary: num  93827 79084 74940 71726 80181 ...
## [1] "Case: z-score/scaled"
## Importance of components:
##                          PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8    PC9
## Standard deviation     1.102 1.025 1.014 1.004 1.000 0.994 0.984 0.971 0.8933
## Proportion of Variance 0.135 0.117 0.114 0.112 0.111 0.110 0.108 0.105 0.0887
## Cumulative Proportion  0.135 0.252 0.366 0.478 0.589 0.699 0.806 0.911 1.0000
##                      PC1     PC2      PC3     PC4     PC5     PC6     PC7
## country           0.0174 -0.2169  0.49430 -0.2268 -0.3847 -0.6518 -0.0939
## gender            0.0400  0.3537  0.00235 -0.6836 -0.2356  0.4357 -0.1900
## age              -0.2528  0.0725  0.42284  0.0360  0.4164  0.0649 -0.7364
## tenure            0.0315 -0.3271 -0.63071 -0.0852  0.2680 -0.2529 -0.3865
## balance          -0.6766  0.0141 -0.14577 -0.0524 -0.1560 -0.0168  0.0784
## products_number   0.6843 -0.0490  0.05566 -0.0364  0.0466  0.0481 -0.1302
## credit_card       0.0585  0.2443 -0.10594  0.6170 -0.5775  0.1111 -0.3936
## active_member    -0.0495 -0.5350  0.35024  0.2136  0.1022  0.4981  0.1603
## estimated_salary -0.0339 -0.6059 -0.14432 -0.2058 -0.4252  0.2328 -0.2445
##                       PC8      PC9
## country          -0.27396 -0.01332
## gender           -0.35012  0.04379
## age               0.17408 -0.02363
## tenure           -0.44910  0.00870
## balance          -0.00757 -0.69798
## products_number   0.11156 -0.70069
## credit_card      -0.21035  0.00847
## active_member    -0.50479 -0.06333
## estimated_salary  0.50872  0.12266

From the graph for Proportion of variance we can see that almost 90% of variance is explained by 9 principal components.

Feature Selection

3. Which model can give the best results based on adjusted R square value, along with lower BIC and Cp?

Feature selection using Exhaustive Search

Using exhaustive search age,balance were selected which is a two variable model with adjusted R^2 value 0.14.

The best model selected using BIC is credit_score,age,balance which is a 3 variable model with BIC value -1500.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model. with Cp value 7.

##                  Abbreviation
## credit_score           crdt_s
## country                    cn
## gender                      g
## age                        ag
## tenure                      t
## balance                     b
## products_number             p
## credit_card            crdt_c
## active_member              a_
## estimated_salary            e
##                  Abbreviation
## credit_score           crdt_s
## country                    cn
## gender                      g
## age                        ag
## tenure                      t
## balance                     b
## products_number             p
## credit_card            crdt_c
## active_member              a_
## estimated_salary            e

The best model selected using adjusted R^2 plot is credit_score,gender,age,country,tenure,balance,credit_card,active_member which is a 8 variable model with R^2 value 0.1440

The Mallow Cp plot selected four best models with highest mallow Cp value 9.5. 1)credit_score,country,gender,age,tenure,balance,credit_card,ative_member,estimated salary 9 variable model 2)credit_score,country,gender,age,tenure,balance,active_member,estimated salary 8 variable model 3)credit_score,country,gender,age,tenure,balance,estimated_salary. 4)credit_score,gender,tenure,balance,estimated salary 5 variable model

Feature selection using forward search

The best model selected using forward search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.

The best model selected using BIC is credit_card,age,balance which is a 3 variable model with value -1500 Cp.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.

Feature selection using backward search

The best model selected using backward search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.

The best model selected using BIC is credit_card,age,balance which is a 3 variable model with value -1500 Cp.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.

Feature selection using Sequential Replacement

The best model selected using sequential search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.

The best model selected using BIC is credit_card,age,balance,estimated_salary which is a 4 variable model with value -1500 Cp.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.

The best models selected using feature selection methods are

1 Age, balance(2 variable model)

  • Forward search: Adj R^2,

  • Exhaustive search: Adj R^2

  • Backward search: Adj R^2

  • Sequential search:Adj R^2

2 Credit_score,age,balance(3 variable model)

  • Forward search: BIC

  • Exhaustive search: BIC

  • Backward search: BIC

  • Sequential search:BIC

3 Credit_score,age,tenure,balance,estimated_salary(5 variable model)

  • Exhaustive search: CP

  • Forward search: CP

  • Backward search: CP

  • Sequential search:CP

4.Credit_score, gender, age, balance,  estimated_salary(5 variable model)

  • Mallow CPG

5.Credit_score, country, gender, age, tenure, balance, estimated_salary(7 variable model)

  • Mallow CPG

6 Credit_score,country,gender,age,tenure,balance,active_member,estimated_salary(8 variable model)

  • Mallow CPG

7 Credit_score, country, gender, age,tenure,balance,credit_card       active_member,estimated_salary(9 variable model)

  • Mallow CPG

  • Adj R^2

Chapter 6: Comparing the Models Chosen in Feature Selection

Methods of Model Evaluation

When evaluating the models, we are focusing particularly on False negatives because it is vital for any bank to correctly predict churn and in an ideal scenario without any errors. False negatives are of highest importance as we do not want to incorrectly predict that the customers are retained, when they are in fact churned.

The second way we will be evaluating the models is by Area Under the Curve. The higher this number is, the better our model.

Model with all variables:

## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.523  -0.944  -0.630   1.058   2.237  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.10e+00   1.99e-01  -15.54   <2e-16 ***
## credit_score     -6.01e-04   2.22e-04   -2.71   0.0068 ** 
## country          -1.83e-02   2.65e-02   -0.69   0.4895    
## gender           -5.48e-02   4.40e-02   -1.25   0.2127    
## age               7.70e-02   2.28e-03   33.68   <2e-16 ***
## tenure           -1.72e-02   7.50e-03   -2.29   0.0218 *  
## balance           4.82e-06   3.64e-07   13.24   <2e-16 ***
## products_number  -9.63e-02   3.38e-02   -2.85   0.0044 ** 
## credit_card      -1.50e-02   4.80e-02   -0.31   0.7554    
## active_member     4.65e-02   4.38e-02    1.06   0.2889    
## estimated_salary -5.47e-07   3.81e-07   -1.43   0.1516    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13734  on 9999  degrees of freedom
## Residual deviance: 12107  on 9989  degrees of freedom
## AIC: 12129
## 
## Number of Fisher Scoring iterations: 4

Model 1 - churn ~ age + balance

## 
## Call:
## glm(formula = churn ~ age + balance, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.181  -0.373  -0.176   0.445   1.022  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.19e-01   1.88e-02   -17.0   <2e-16 ***
## age          1.65e-02   4.24e-04    39.0   <2e-16 ***
## balance      1.05e-06   7.40e-08    14.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.21)
## 
##     Null deviance: 2467.9  on 9999  degrees of freedom
## Residual deviance: 2098.5  on 9997  degrees of freedom
## AIC: 12773
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-1
Predicted 0 Predicted 1 Total
Actual 0 4542 1025 5567
Actual 1 1994 2439 4433
Total 6536 3464 10000
## Area under the curve: 0.747

Model 2 - churn ~ credit_card+age + balance

## 
## Call:
## glm(formula = churn ~ credit_card + age + balance, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.180  -0.373  -0.175   0.445   1.023  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.17e-01   2.01e-02  -15.74   <2e-16 ***
## credit_card -2.91e-03   1.01e-02   -0.29     0.77    
## age          1.65e-02   4.24e-04   38.95   <2e-16 ***
## balance      1.05e-06   7.40e-08   14.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.21)
## 
##     Null deviance: 2467.9  on 9999  degrees of freedom
## Residual deviance: 2098.5  on 9996  degrees of freedom
## AIC: 12775
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-2
Predicted 0 Predicted 1 Total
Actual 0 4542 1025 5567
Actual 1 1993 2440 4433
Total 6535 3465 10000
## Area under the curve: 0.747

Model 3 - churn ~ credit_score+age+tenure+ balance + active_member+estimated_salary

## 
## Call:
## glm(formula = churn ~ credit_score + age + tenure + balance + 
##     active_member + estimated_salary, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.167  -0.373  -0.173   0.443   1.029  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.09e-01   3.78e-02   -5.53  3.2e-08 ***
## credit_score     -1.30e-04   4.65e-05   -2.79   0.0053 ** 
## age               1.65e-02   4.24e-04   38.86  < 2e-16 ***
## tenure           -3.62e-03   1.57e-03   -2.30   0.0212 *  
## balance           1.06e-06   7.40e-08   14.30  < 2e-16 ***
## active_member     1.08e-02   9.17e-03    1.18   0.2372    
## estimated_salary -1.21e-07   7.97e-08   -1.52   0.1289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.21)
## 
##     Null deviance: 2467.9  on 9999  degrees of freedom
## Residual deviance: 2095.1  on 9993  degrees of freedom
## AIC: 12765
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-3
Predicted 0 Predicted 1 Total
Actual 0 4537 1030 5567
Actual 1 1983 2450 4433
Total 6520 3480 10000
## Area under the curve: 0.747

Model 4 - churn ~ credit_score + country+ gender+age+tenure+balance+active_member+credit_card+estimated_salary

## 
## Call:
## glm(formula = churn ~ credit_score + country + gender + age + 
##     tenure + balance + active_member + credit_card + estimated_salary, 
##     data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.157  -0.371  -0.174   0.443   1.034  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.98e-01   3.90e-02   -5.07    4e-07 ***
## credit_score     -1.29e-04   4.65e-05   -2.78   0.0055 ** 
## country          -3.65e-03   5.54e-03   -0.66   0.5100    
## gender           -1.23e-02   9.20e-03   -1.34   0.1816    
## age               1.65e-02   4.24e-04   38.85   <2e-16 ***
## tenure           -3.67e-03   1.57e-03   -2.34   0.0194 *  
## balance           1.06e-06   7.40e-08   14.30   <2e-16 ***
## active_member     1.06e-02   9.17e-03    1.15   0.2497    
## credit_card      -3.72e-03   1.00e-02   -0.37   0.7114    
## estimated_salary -1.21e-07   7.97e-08   -1.51   0.1300    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.21)
## 
##     Null deviance: 2467.9  on 9999  degrees of freedom
## Residual deviance: 2094.6  on 9990  degrees of freedom
## AIC: 12768
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-4
Predicted 0 Predicted 1 Total
Actual 0 4539 1028 5567
Actual 1 1977 2456 4433
Total 6516 3484 10000
## Area under the curve: 0.747

Model 5 - churn ~ credit_score + country+ gender+age+tenure+balance+active_member+estimated_salary

## 
## Call:
## glm(formula = churn ~ credit_score + country + gender + age + 
##     tenure + balance + active_member + estimated_salary, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.158  -0.372  -0.174   0.443   1.033  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.01e-01   3.82e-02   -5.26  1.5e-07 ***
## credit_score     -1.29e-04   4.65e-05   -2.77   0.0056 ** 
## country          -3.63e-03   5.54e-03   -0.66   0.5120    
## gender           -1.23e-02   9.20e-03   -1.33   0.1823    
## age               1.65e-02   4.24e-04   38.86  < 2e-16 ***
## tenure           -3.67e-03   1.57e-03   -2.33   0.0196 *  
## balance           1.06e-06   7.40e-08   14.30  < 2e-16 ***
## active_member     1.06e-02   9.17e-03    1.16   0.2479    
## estimated_salary -1.21e-07   7.97e-08   -1.51   0.1305    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.21)
## 
##     Null deviance: 2467.9  on 9999  degrees of freedom
## Residual deviance: 2094.6  on 9991  degrees of freedom
## AIC: 12767
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-5
Predicted 0 Predicted 1 Total
Actual 0 4541 1026 5567
Actual 1 1980 2453 4433
Total 6521 3479 10000
## Area under the curve: 0.747

Model 6 - churn ~ credit_score+country+gender+age+tenure+balance+estimated_salary

## 
## Call:
## glm(formula = churn ~ credit_score + country + gender + age + 
##     tenure + balance + estimated_salary, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.164  -0.373  -0.174   0.444   1.028  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.96e-01   3.80e-02   -5.17  2.4e-07 ***
## credit_score     -1.28e-04   4.65e-05   -2.75   0.0059 ** 
## country          -3.59e-03   5.54e-03   -0.65   0.5163    
## gender           -1.25e-02   9.20e-03   -1.36   0.1738    
## age               1.65e-02   4.24e-04   38.88  < 2e-16 ***
## tenure           -3.68e-03   1.57e-03   -2.34   0.0191 *  
## balance           1.06e-06   7.40e-08   14.29  < 2e-16 ***
## estimated_salary -1.18e-07   7.97e-08   -1.48   0.1399    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.21)
## 
##     Null deviance: 2467.9  on 9999  degrees of freedom
## Residual deviance: 2094.9  on 9992  degrees of freedom
## AIC: 12766
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-6
Predicted 0 Predicted 1 Total
Actual 0 4532 1035 5567
Actual 1 1978 2455 4433
Total 6510 3490 10000
## Area under the curve: 0.747

Model 7 - churn ~ credit_score+gender+age+tenure+balance+estimated_salary

## 
## Call:
## glm(formula = churn ~ credit_score + gender + age + tenure + 
##     balance + estimated_salary, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.166  -0.373  -0.174   0.443   1.030  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.99e-01   3.78e-02   -5.26  1.5e-07 ***
## credit_score     -1.28e-04   4.65e-05   -2.76   0.0058 ** 
## gender           -1.25e-02   9.20e-03   -1.36   0.1748    
## age               1.65e-02   4.23e-04   38.87  < 2e-16 ***
## tenure           -3.66e-03   1.57e-03   -2.33   0.0196 *  
## balance           1.06e-06   7.40e-08   14.30  < 2e-16 ***
## estimated_salary -1.18e-07   7.97e-08   -1.49   0.1372    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.21)
## 
##     Null deviance: 2467.9  on 9999  degrees of freedom
## Residual deviance: 2095.0  on 9993  degrees of freedom
## AIC: 12764
## 
## Number of Fisher Scoring iterations: 2
Confusion matrix from Logit Model-7
Predicted 0 Predicted 1 Total
Actual 0 4527 1040 5567
Actual 1 1978 2455 4433
Total 6505 3495 10000
## Area under the curve: 0.747

Best Model

Model 5 - Credit_score, country, gender, age, tenure, balance, active_member,estimated_salary

Based on our criteria for best model, Model 5 to be the best model which has comparatively lower False Negatives (1994) and area under curve (0.745). The reason for choosing False Negatives is that we are okay with the model predicting churn for a customer when it is actually retained. However, if a customer is actually churned and our model is predicting that it is retained then it is a problem and the errors in classifications must be highlighted as it is very critical for the bank to correctly identify churned customers. Therefore, the verdict for the best models is decided based on the which model has lower False Negatives.

Chapter 7: Bibliography

Kaemingk, D. (2018, August 29). Reducing customer churn for banks and financial institutions. Qualtrics. Retrieved November 2, 2022, from https://www.qualtrics.com/blog/customer-churn-banking/

Guliyev, H., & Yerdelen Tatoğlu, F. (2021). Customer churn analysis in banking sector: Evidence from explainable machine learning models. Journal of Applied Microeconometrics, 1(2), 85–99. https://doi.org/10.53753/jame.1.2.03